Open Data Station

In this notebook we present a short example of how use Python to perform data analysis and visualization.

Tool selected for this study: Python

Why Python?

  • It is considered a good language for starting programers.
  • The learning curve is relatively low and gradual.
  • Emphasizes productivity and code readability.
  • Fast growing community and support.
  • Developers can also use it for scripting a website or other applications.
  • Fast development of data science libraries.

Case of Study:

UNHCR Refugee Data: Data on Uprooted Populations and Asylum Processing

Data Source: Kaggle

Context:

The mass movement of uprooted people is a highly charged geopolitical issue. This data, gathered by the UN High Commissioner for Refugees (UNHCR), covers movement of displaced persons (asylum seekers, refugees, internally displaced persons (IDP), stateless). Also included are destination country responses to asylum petitions.

Content:

This dataset includes 6 csv files covering:

  • Asylum monthly applications opened (asylum_seekers_monthly.csv)
  • Yearly progress through the refugee system (asylum_seekers.csv)
  • Refugee demographics (demographics.csv)
  • Yearly time series data on UNHCR’s populations of concern (time_series.csv)
  • Yearly population statistics on refugees by residence and destination (persons_of_concern.csv)
  • Yearly data on resettlement arrivals, with or without UNHCR assistance (resettlement.csv)

Asylum Monthly Applications

Description: (1999-2016) Monthly totals about asylum applications opened in 38 European and 6 non-European countries, by month and origin. Repeat/reopened/appealed applications are largely excluded.

Prepare Notebook

In [1]:
# Numpy is a library for numeric analysis.
import numpy as np
# Pandas is used for data analysis and data frame manipulation.
import pandas as pd
# NetworkX is a library for network analysis.
import networkx as nx
# Matplotlib is used for visualization.
import matplotlib.pyplot as plt
# Seaborn is another visualization library build on top of Matplotlib. 
import seaborn as sns
sns.set()

# Plotly is used for Dynamic Visualization.
import cufflinks as cf
import plotly.plotly as py
#init_notebook_mode(connected=True)
cf.go_offline()

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

Read Data

Download the data from Kaggle and save the folder refugee-data inside the data folder on the GitHub repository.

In [2]:
# Read .csv file.
## We just indicate the path where is stored.
raw_data_df = pd.read_csv('../data/refugee-data/asylum_seekers_monthly.csv')

# Let us see a sample of the data.
raw_data_df.tail(20)
Out[2]:
Country / territory of asylum/residence Origin Year Month Value
332169 USA (INS/DHS) Zambia 2017 January *
332170 USA (INS/DHS) Zambia 2017 February *
332171 USA (INS/DHS) Zambia 2017 March *
332172 USA (INS/DHS) Zambia 2017 April *
332173 USA (INS/DHS) Zimbabwe 2016 January 13
332174 USA (INS/DHS) Zimbabwe 2016 February 11
332175 USA (INS/DHS) Zimbabwe 2016 March 18
332176 USA (INS/DHS) Zimbabwe 2016 April 11
332177 USA (INS/DHS) Zimbabwe 2016 May 28
332178 USA (INS/DHS) Zimbabwe 2016 June 18
332179 USA (INS/DHS) Zimbabwe 2016 July 11
332180 USA (INS/DHS) Zimbabwe 2016 August 12
332181 USA (INS/DHS) Zimbabwe 2016 September 22
332182 USA (INS/DHS) Zimbabwe 2016 October 22
332183 USA (INS/DHS) Zimbabwe 2016 November 35
332184 USA (INS/DHS) Zimbabwe 2016 December 28
332185 USA (INS/DHS) Zimbabwe 2017 February 27
332186 USA (INS/DHS) Zimbabwe 2017 March 42
332187 USA (INS/DHS) Zimbabwe 2017 April 16
332188 USA (INS/DHS) Zimbabwe 2017 May 12
In [3]:
# Get dimensions of the data set.
raw_data_df.shape
Out[3]:
(332189, 5)

Data Cleaning & Formating

In [4]:
data_df = raw_data_df.copy()

# Remove strange characters.
data_df.replace(to_replace={'*': np.nan}, inplace=True)

# Change column names.
data_df.rename(columns = {'Country / territory of asylum/residence' : 'Country'}, inplace=True)

# Covert Year variable to character. 
data_df['Year'] = data_df['Year'].astype('str')

# Include a Date column.
data_df['Date'] = data_df[['Year', 'Month']].apply(lambda x: '-'.join(x), axis=1)
data_df['Date'] = pd.to_datetime(data_df['Date'], format = '%Y-%B')

# Covert Value variable to numeric. 
data_df['Value'] = pd.to_numeric(data_df['Value'], downcast='integer')

data_df.head()
Out[4]:
Country Origin Year Month Value Date
0 Australia Afghanistan 1999 January 8.0 1999-01-01
1 Australia Afghanistan 1999 February 10.0 1999-02-01
2 Australia Afghanistan 1999 March 25.0 1999-03-01
3 Australia Afghanistan 1999 April 25.0 1999-04-01
4 Australia Afghanistan 1999 May 7.0 1999-05-01

Exploratory Data Analysis

In [5]:
data_df['Country'].unique()
Out[5]:
array(['Australia', 'Austria', 'Belgium', 'Bulgaria', 'Canada',
       'Czech Rep.', 'Denmark', 'Finland', 'France', 'Germany', 'Greece',
       'Hungary', 'Ireland', 'Liechtenstein', 'Luxembourg', 'Netherlands',
       'Norway', 'Poland', 'Portugal', 'Rep. of Korea', 'Romania',
       'Slovakia', 'Slovenia', 'Spain', 'Sweden', 'Switzerland', 'Turkey',
       'United Kingdom of Great Britain and Northern Ireland',
       'USA (EOIR)', 'New Zealand', 'USA (INS/DHS)', 'Cyprus', 'Iceland',
       'Japan', 'Croatia', 'Estonia', 'Latvia', 'Malta',
       'Serbia and Kosovo: S/RES/1244 (1999)', 'Lithuania', 'Albania',
       'Montenegro', 'The former Yugoslav Rep. of Macedonia',
       'Bosnia and Herzegovina', 'Italy'], dtype=object)
In [6]:
data_df['Origin'].unique()
Out[6]:
array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
       'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahrain',
       'Bangladesh', 'Belarus', 'Belgium',
       'Bolivia (Plurinational State of)', 'Bosnia and Herzegovina',
       'Brazil', 'Bulgaria', 'Burundi', "Côte d'Ivoire", 'Cambodia',
       'Cameroon', 'Canada', 'Chile', 'China', 'China, Hong Kong SAR',
       'Colombia', 'Congo', 'Croatia', 'Cuba', 'Czech Rep.',
       "Dem. People's Rep. of Korea", 'Dem. Rep. of the Congo', 'Denmark',
       'Dominica', 'Ecuador', 'Egypt', 'El Salvador', 'Estonia',
       'Ethiopia', 'Fiji', 'France', 'Georgia', 'Germany', 'Ghana',
       'Greece', 'Guatemala', 'Guinea', 'Hungary', 'India', 'Indonesia',
       'Iran (Islamic Rep. of)', 'Iraq', 'Ireland', 'Israel', 'Italy',
       'Japan', 'Jordan', 'Kazakhstan', 'Kenya', 'Kuwait', 'Kyrgyzstan',
       "Lao People's Dem. Rep.", 'Latvia', 'Lebanon', 'Liberia', 'Libya',
       'Lithuania', 'Malawi', 'Malaysia', 'Mexico', 'Mongolia', 'Morocco',
       'Mozambique', 'Myanmar', 'Nepal', 'Netherlands', 'Nicaragua',
       'Nigeria', 'Oman', 'Pakistan', 'Palestinian', 'Papua New Guinea',
       'Peru', 'Philippines', 'Poland', 'Portugal', 'Rep. of Korea',
       'Rep. of Moldova', 'Romania', 'Russian Federation', 'Rwanda',
       'Samoa', 'Saudi Arabia', 'Senegal',
       'Serbia and Kosovo: S/RES/1244 (1999)', 'Sierra Leone',
       'Singapore', 'Slovenia', 'Somalia', 'South Africa', 'Spain',
       'Sri Lanka', 'Stateless', 'Sudan', 'Swaziland', 'Syrian Arab Rep.',
       'Tajikistan', 'Thailand', 'The former Yugoslav Rep. of Macedonia',
       'Togo', 'Tonga', 'Tunisia', 'Turkey', 'Tuvalu', 'Uganda',
       'Ukraine', 'United Arab Emirates',
       'United Kingdom of Great Britain and Northern Ireland',
       'United Rep. of Tanzania', 'United States of America', 'Uruguay',
       'Uzbekistan', 'Various/unknown',
       'Venezuela (Bolivarian Republic of)', 'Viet Nam', 'Yemen', 'Benin',
       'Bhutan', 'Eritrea', 'Gabon', 'Gambia', 'Guinea-Bissau', 'Jamaica',
       'Mali', 'Mauritania', 'Niger', 'Slovakia', 'Zimbabwe',
       'Burkina Faso', 'Cabo Verde', 'Central African Rep.', 'Chad',
       'Comoros', 'Djibouti', 'Equatorial Guinea', 'Haiti', 'Namibia',
       'Turkmenistan', 'Zambia', 'Antigua and Barbuda', 'Bahamas',
       'Barbados', 'Belize', 'Botswana', 'Costa Rica', 'Cyprus',
       'Dominican Rep.', 'Grenada', 'Guyana', 'Honduras', 'Kiribati',
       'Lesotho', 'Madagascar', 'Maldives', 'Malta', 'Panama', 'Paraguay',
       'Saint Lucia', 'Saint Vincent and the Grenadines',
       'Sao Tome and Principe', 'Seychelles', 'Sweden', 'Switzerland',
       'Trinidad and Tobago', 'Norway', 'French Guiana', 'Liechtenstein',
       'Mauritius', 'Tibetan', 'Finland', 'Suriname',
       'Turks and Caicos Islands', 'New Zealand', 'Montenegro',
       'Solomon Islands', 'Western Sahara', 'China, Macao SAR',
       'Gibraltar', 'Martinique', 'New Caledonia',
       'Saint Kitts and Nevis', 'Luxembourg', 'Qatar', 'Nauru', 'Andorra',
       'Cayman Islands', 'Iceland', 'Monaco', 'Timor-Leste', 'Bermuda',
       'San Marino', 'Brunei Darussalam', 'Puerto Rico', 'South Sudan',
       'Micronesia (Federated States of)'], dtype=object)

Let us plot the number of applications received per country.

In [7]:
fig, ax= plt.subplots(figsize=(15, 15))

data_df \
    .groupby('Country', as_index=False) \
    .agg({'Value' : np.sum}) \
    .sort_values('Value', ascending=True) \
    .plot(kind='barh',
        x='Country', 
        y='Value', 
        color='blue',
        ax=ax
    )

ax.set(title='Target Countries Asylum Applications (1999-2016)');

Similarly, we plot the top applicant countries.

In [8]:
fig, ax= plt.subplots(figsize=(15, 15))

data_df \
    .groupby('Origin', as_index=False) \
    .agg({'Value' : np.sum}) \
    .sort_values('Value', ascending=True) \
    .tail(50) \
    .plot(kind='barh',
        x='Origin', 
        y='Value', 
        color='red',
        ax=ax
    )

ax.set(title='Origin Countries Asylum Applications (1999-2016)');

Dynamic Visualization

Let us focus on Country=Germany.

In [9]:
data_df \
    .query('Country == "Germany"') \
    .groupby('Date') \
    .agg({'Value': np.sum}) \
    .iplot(
        title='Number of Applications to Germany', 
        xTitle='Date', 
        color='blue'
    )

We can also get the country ranking with respec to received applications.

In [10]:
top_in_countries = data_df \
    .groupby('Country', as_index=False) \
    .agg({'Value' : np.sum}) \
    .sort_values('Value', ascending=False) \
    .head(3) \
    ['Country'] \
    .values
In [11]:
data_df \
    .query('Country in @top_in_countries') \
    .groupby(['Country', 'Date']) \
    .agg({'Value': np.sum}) \
    .unstack(0) \
    ['Value'] \
    .iplot(
        title='Number of Applications to Top 3 Countries', 
        xTitle='Date'
    )

Let us see which are to top countries applying to Germany:

In [12]:
germany_df = data_df \
    .groupby(['Origin', 'Country'], as_index=False) \
    .agg({'Value': np.sum}) \
    .sort_values('Value', ascending=False)

germany_df.head()
Out[12]:
Origin Country Value
4698 Syrian Arab Rep. Germany 517479.0
2358 Iraq Germany 233882.0
14 Afghanistan Germany 230313.0
2385 Iraq Turkey 215463.0
4172 Serbia and Kosovo: S/RES/1244 (1999) Germany 204735.0

We see that Syrian Arab Rep is te top one. Let us see the applicatioin jusr for Syrian Arab Rep as a time series:

In [13]:
data_df \
    .query('Country == "Germany"') \
    .query('Origin == "Syrian Arab Rep."') \
    .query('Date > "2009-12-31"') \
    .groupby('Date') \
    .agg({'Value': np.sum}) \
    .iplot(
        title='Number of Applications to Germany from Syrian Arab Rep.', 
        xTitle='Date', 
        color='blue'
    )

We see a high increase from in the period 2015 - 2016.

Question: Could have we predicted this?

Let us see if we see any inidicator using Google Trends.

In [14]:
 # This library allow us to connect to Google Trends directly. 
from pytrends.request import TrendReq

pytrends = TrendReq(tz=360)

# Let us get the Google search index data for the word "germany" in Arab. 
kw_list = ['ألمانيا']

pytrends.build_payload(
    kw_list=kw_list, 
    cat=0, 
    timeframe='2010-01-01 2018-01-01', 
    geo='SY', 
    gprop=''
)

google_trends_data = pytrends.interest_over_time()

syria_germany_search_df = google_trends_data.reset_index()
syria_germany_search_df.head()
Out[14]:
date ألمانيا isPartial
0 2010-01-01 7 False
1 2010-02-01 13 False
2 2010-03-01 19 False
3 2010-04-01 10 False
4 2010-05-01 11 False
In [15]:
# Data processing. 
syria_germany_search_df = syria_germany_search_df \
    .replace(to_replace={'<1': 0}) \
    .assign(date = lambda x : pd.to_datetime(x['date'], format = '%Y-%m-%d')) \
    .assign(ألمانيا = lambda x : pd.to_numeric(x['ألمانيا'], downcast = 'integer')) \

syria_germany_search_df.head()
Out[15]:
date ألمانيا isPartial
0 2010-01-01 7 False
1 2010-02-01 13 False
2 2010-03-01 19 False
3 2010-04-01 10 False
4 2010-05-01 11 False
In [16]:
syria_germany_search_df \
    .query('date < "2017-06-01"') \
    .iplot(
        x='date', 
        y='ألمانيا', 
        title = 'Search Data for the Keyword : Germany (ألمانيا) in Syria', 
        color = 'red'
)

We see a remarkable peak in search in the summer of 2014.

Network Visualization

Let us represent the top applications to Germany as a weighted network object:

  • The nodes are the countries.
  • Two countries are connected if there are associated applications.

Let us work on the Germany network.

In [17]:
network_df = data_df \
    .groupby(['Origin', 'Country'], as_index=False) \
    .agg({'Value': np.sum}) \
    .sort_values('Value', ascending=False)

plot_network_df = network_df \
    .query('Country == "Germany"') \
    .head(50)

Let us encode the data as a network:

In [18]:
G = nx.from_pandas_edgelist(
    df=plot_network_df, 
    source = 'Origin',
    target = 'Country', 
    edge_attr = 'Value', 
    create_using = nx.DiGraph()
)

Now we plot the network (and save it as pdf).

In [19]:
fig, ax= plt.subplots(figsize=(35, 35))

nodes_size = [x[1]/50 for x in list(G.degree(weight='Value'))]

pos = nx.spring_layout(G, iterations=500)

nx.draw(
    G=G,
    pos=pos,
    with_labels=True, 
    arrows=True, 
    node_color='#a2cffe',
    width=0.7,
    edge_color='.4',
    font_size=15,
    font_color='black', 
    node_size=nodes_size,
    ax=ax
)

plt.savefig('../plots/de_network_plot.pdf')

We can also generate a more complex network:

In [20]:
network_df = data_df \
    .groupby(['Origin', 'Country'], as_index=False) \
    .agg({'Value': np.sum}) \
    .sort_values('Value', ascending=False)

plot_network_df = network_df \
    .head(300)

G = nx.from_pandas_edgelist(
    df=plot_network_df, 
    source = 'Origin',
    target = 'Country', 
    edge_attr = 'Value', 
    create_using = nx.DiGraph()
)

fig, ax= plt.subplots(figsize=(100, 100))

nodes_size = [x[1]/600 for x in list(G.degree(weight='Value'))]

pos = nx.spring_layout(G, iterations=500)

nx.draw(
    G=G,
    pos=pos,
    with_labels=True, 
    arrows=True, 
    node_color='#a2cffe',
    width=0.7,
    edge_color='.4',
    font_size=15,
    font_color='black', 
    node_size=nodes_size,
    ax=ax
)

plt.savefig('../plots/all_network_plot.pdf')